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Abstract 



Beyer 1994. [Bowie, Liithy fc Eisenberg 1991, Bryant 



Lawrence 1993, Casari & Sipp] 1992, Godzik. 



An optimization technique is used to determine the Kolinsky & Skolnick 1992, Goldstein, Luthey-Shulten 



pairwise interactions between amino acids in globular & Wolyncs 1992, Socci & Onuchic 1994, Huang, Sub 



proteins for maximizing the native fold stability with Duzounis, Sander, Sharf fc Schneider 



respect to alternative structures obtained by gapless Eisenberg 



threading. The extracted parameters are shown to be & Skolnick 



1994| pandekar fc Argos 



1994, Levitt 1976, Levitt 



very reliable for identifying the native states of pro- Kolinsky, Brooks, Godzik fc Rey| |1993|, pun| 1993 , 



teins (unrelated to those in the training set) among 
thousands of conformations. The only poor perform- 



ers are proteins with heme groups and/or poor com- Banavar 1998, Seno, Micheletti, Maritan fc Banavar) 



pactness whose complexity cannot be captured by 1998, Creighton 1993, Branden fc Toozc 1991, Anfin- 



standard pairwise energy functionals. 



1993 



1994 



1982 



Bowie fc 



proteins. A numerical strategy is applied to a set of biah & Levitt 1995, Jones, Taylor fc Thornton| 1992, 



Kolinsky 



Skolnick 



Wallqvist fc Ullnci 1994, Dcutsch fc Kurosky 199G, 



Srinivasan fc Rose 1995, Micheletti, Seno, Maritan 



ar| 

n- 
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1 Introduction 



1973). This builds on the assumption that the 
interactions between amino acids (and the solvent) 
are principally responsible for driving the folding of 
a protein to its native state. This is supported by con- 
siderable experimental evidence that the native states 
of many globular proteins correspond to free energy 
minima (|Anfinscn 1973, Wolyncs, Onuchic & Thiru 



A knowledge of the interaction potentials between malai| |l 994 |Creightoii||l993| , [Branden fc Toozc||l9"9lfJ 



amino acids is of crucial importance both for pre- 
dicting the three-dimensional structure of a pro- 
tein's native state and for designing novel proteins, 



folding on a desired target conformation (Bauer & 



On a microscopic scale, all-atom potentials are 
used to carry out "first principle" molecular dynam- 
ics for folding (van Gunstcrcn 1989). Due to the 
high level of details included in such calculations, 



the folding processes of short peptides can be fol- 
lowed only for rather short time-scales (of the or- 
der of 1 (txs as in Duan & Kollman (1998J)). While 



the impact of such "ab initio" calculations is des- 
tined to grow rapidly, at present, highly satisfac- 
tory results can be obtained by adopting mesoscopic 
phenomenological approaches. Within this frame- 
work, a commonly used approach is to avoid a de- 
tailed description of an amino acid but represent it 
as a sphere or an ellipsoid centered o n the C a or 
Cp position (Maiorov fc Crippcn 1992 , Srinivasan fe 



Rose 1995, Kolinsky, Godzik fc Skolnick 1993, Sun 



protein can attain. For each sequence this yields a 
set of linear inequalities involving the unknown in- 
teraction potentials. Two key points need to be ad- 
dressed carefully when applying this procedure: the 
parametrization of the Hamiltonian and the genera- 
tion of alternative conformations. If the parametriza- 
tion of the energy is too poor and/or there are un- 
physical conformations among the decoys (i.e. vio- 
lating steric contraints), then no consistent solution 
can be found (unlearnable p roble m, for an example 
see van Mourik et al.| (|1998[ ) and Vcndruscolo, Naj 



manovich fc Domany (1999J)) . On the other hand 



Brem, Chan fe DiH||l995J |Michclctti et al.||l998D . This 
coarse-grained procedure amounts to integrating out 
the fine degrees of freedom of a peptide chain and in- 
troduces effective interactions between the surviving 
degrees of freedom. 

One commonly used strategy to extract coarse 
grained potentials between pairs of amino acids has 
been proposed by Miyazawa & Jernigan (1985). The 



method is based on the quasichcmical approximation 
and it entails the calculation of pairing frequencies of 
amino acids observed in native structures of naturally 
occurring proteins. Similar approaches have been re- 
viewed by |Sippl| (J1995J ) and |Wodak fc Rooman| ( |1993| ) . 
Thomas & Dill (1996) have recently tested the valid- 



if the parametrization is reliable and there are no 
unphysical decoy conformations, the energy param- 
eter satisfying the inequalities lie in a convex region 
of parameter space. While all points inside the cell 
satisfy the whole set of inequalities, there is an opti- 
mal point, typically equidistant from the hyperplanes 
bounding the cell. The potential parameters corre- 
sponding to the optimal point, ensure that the na- 
tive states of proteins are maximally stable with re- 
spect to alternative structures. Our strategy aims at 
pinpointing the optimal solution, while the original 
Maiorov and Crippen strategy stopped when reaching 
an unspecified sub-optimal point inside the cell. Our 



approach differs from the one employed in (Maiorov 



ity of this procedure on exactly solvable lattice mod- 



els for proteins. In all the cases they considered the tion matrix: in our scheme 



Crippen 1992) also because of the different interac- 

as in 



the true potentials, although the two sets shared a 
common trend. 

A different strategy for extracting potentials was 
suggested by Maiorov fc Crippen ( 1992J ) and re 



Miyazawa & Jcrni- 



cxtractcd potentials did not correlate too well with gan (1985) or Kolinsky et al. (1993|)) the interaction 



cently an optimized version has been introduced ( van 



Mourik, Clcmenti, Maritan, Seno & Banavar 1998 



Scno, Maritan fc Banavai 1998 ). Rigoro us tests, sim- 
ilar to the ones in Thomas fc Dilj ( 1996 ) , carried out 
both for lattice and off-lattice models have shown 
that the optimized strategy converges to the exact 
potentials for increasing chain length and/or num- 
ber of proteins in the training set. The method, ex- 
plained in detail in section 0, uses the following basic 
ingredients: the potentials parametrizing a suitably 
chosen Hamiltonian must be such that the energy of 
a protein sequence in its own native state is lower 
than in any other alternative conformations that the 



energy of amino acids pairs does not depend on their 
sequence separation, while a complementary strat- 
egy was followed by Maiorov and Crippen. In the 
next Section we introduce the coarse grained model 
for proteins and give an overview of the optimal po- 
tential extraction technique. The latter is discussed 
in detail in Section [|. An assessement of the perfor- 
mance of extracted potentials, and a comparison with 
previously known interactions, are given in Section pi 



2 Theory 

2.1 The Model 

We choose not to introduce any subdivision of amino 
acids in classes and retain the full repertoire of 20 



types. As is customary, we used a simplified represen- 
tation of protein structures and replaced amino acids 



TV 



with a ccntroid placed at the Cp position ( 3rinivasan 
fc Rose 1995J ). A fictitious Cp was constructed for 
glycine and for amino acid entries without it, by us- 



E(S,T)= J2 e(A,^-)-A(r 



t) + 



N 



N 



ing standard rotamer angles following Park & Levitt 



(1996) 



+ J2e(0,Ai). J2 A ( r <> r J 



(3) 



i=i 



The basic assumption is that the stable structure 
of a protein is determined by several factors, that can 
be ultimately reduced, through an averaging process, 
to effective contact interactions between amino acids. 
Thus, we postulate the existence of a functional of 
the contacts between protein residues, which is in 
correspondence with the protein energy. The values 
attained by such a functional should relate to the 
degree of stability of the conformations housing the 
sequence. 

The strength of a contact between two amino acids 
whose Cps are at positions n and r 2 is defined ac- 
cording to the following form, which is a smooth ap- 
proximation to a stepwise contact function with cut- 
off at 8.0A: 



A(r 1 ,r 2 )=tanh((8.0-|r 1 -r 2 |)/2)/2 + 0.5 . (1) 

The smooth nature of A(0, r) ensures that our results 
are not very sensitive to the actual form of A(0,r). 
For simplicity of notation, in the following, we will 
indicate contact maps with the symbol A. 



3 = 1 

j jt i, i ± 1 



The very last sum in (g) corresponds to the to- 
tal number of contacts of the ith residue and reflects 
its degree of burial. Accordingly, polar amino acids, 
typically residing at a protein's surface are expected 
to have solvation parameters, e(0, A), larger than the 
hydrophobic ones. Expression (0) is formally equiv- 
alent to (0) in that it can be rearranged to obtain a 
unique sum involving just 210 terms: 



E(S,T) 



N 

E 



e(A h A,) + e(0, A,) + e(0, A,-)]-A(ri, r 3 ). 

(4) 



Nevertheless, using our strategy to extract energy 
parameters, expressions (|2|) and (|3|) turn out not to 
be equivalent. In expression (|3|), the coefficients mul- 
tiplying e(0, A) are large with respect to those per- 
taining to the general e(A, A') entries. The solva- 
tion term will accordingly give a significant contri- 
bution to the energy of a sequence. This feature 
was shown to be very useful to discriminate the na- 



quence S on a structure T were considered. First 
we adopted the following contact energy function: 



tive state of a protein from decoy structures ( Park & 
Two Hamiltonian forms for the energy of a se- Levitt] |l996| , pahiyat fc Mayo| |1997| ). Furthermore, 



by using o), it is possible to estimate the solvent - 
amino acid interaction, a procedure not carried out 



N 



E(S,T)= J2 <A i ,A j )-A(7 

i>j+l 



(2) 



where the sum is over all pairs of non-consecutive 
residues, N is the protein length and Ai is the amino 
acid type (there are altogether 20 types) at r = r^. e 
is the 20 x 20 matrix of contact energies. Since e is 
symmetric, there are only 210 distinct entries in the 
matrix. We also considered a second form with 20 
additional terms related to the degree of solvation of 
amino acid types: 



by |Maiorov fc Crippcn| ( |1992| ). 

The interaction parameters appearing in eq. (0) 
and (pi) are not completely independent since the en- 
ergy scale can be fixed arbitrarily R. To remove this 
degree of freedom, we choose to set the norm of the 
vector describing the potentials to 1, 



E c 2 (AA') = l 



(5) 



A<A' 



1 In other potential extraction schemes, the potentials are 
shifted to make their average zero. A priori this may not be 
allowed, since the energy shift will typically a.ffect the a.vcragf 
prote in solubility (Giugliarelli, Maritan, Micheletti Ik Banavar 
h99E|). 



2.2 Optimal strategy 

The key prescription at the heart of the potential 
extraction scheme is that a protein sequence attains 
the lowest possible energy when mounted on its cor- 
rect native state. Hence, assuming that the energy 
parametrizations (Q) and (||) are reliable, the correct 
potentials will be such that the native state has the 
lowest energy when compared to alternative confor- 
mations. 

The first step of the analysis was to compile a list 
of non-homologous proteins representing a variety of 
folds (see section |] for details). For each protein 
sequence in this training set, Si (with known na- 
tive state Tj), the alternative structures are obtained 
by threading on conformations in the training set of 
equal or longer length (Jones ct al. 1992| ). Thus, for 
the correct set of potentials: 



more stringent when mounting Si on structurally dis- 
similar conformations. We used three different trial 
functions for /: 



h(x) 



(8) 

(9) 

(10) 



For the distance function D(T,V), appearing in eq. 
(pi), we used the Euclidean distance in contact-map 
space: 



D(r,r') 



-iV 



(JV-l)(7V-2)/2 



-,1/2 



E(Si,Ti) < E(Si,T D ) , 



(6) 



for all the decoy structures, To, obtained by thread- 
ing. Therefore, for each sequence in the training set, 
one obtains an array of inequalities. Due to the finite 
number of proteins in the training set, the whole en- 
semble of inequalities will be satisfied by more than 
a single set of potentials. Indeed, there will typically 
be a whole region of points in parameter space each 



(11) 

D(r,r') can be viewed as a close relative in terms 
of contact maps of the standard distance root mean 
square deviation (DRMSD) but related to our defini- 
tion of the energy functional. 

By threading the training sequences on longer 
structures, we generated the whole set of inequalities 
(ffl) . Each of these identifies a hyperplane in parame- 
ter space dividing space into two semi-infinite regions; 
one of which is compatible with the inequality and 



contains the physical set of parameters ( van Mourik 



inequalities (161). The optimal solution is attained by 
simultaneously maximizing the stability gap for all 
proteins in the set. The stability gap is defined as 
the smallest energy difference between a protein's na- 
tive state and one of the decoy conformations. The 
optimal stability requirement implies that the follow- 
ing inequalities should hold simultaneously for each 
training protein 



corresponding to a set of potentials consistent with ct al. 1998). When more inequalities are used, the 



E(S i ,T D )-E(S i ,T i 

f(D(r D ,r t )) 



> c 



v r 



D 



(7) 



physical region containing the correct parameters re- 
duces to the intersection of all physical hyperspaces. 
Eventually, the region reduces to a small, convex cell 
(not necessarily closed) whose walls are determined 
by a number of inequalities of the order of the dimen- 
sion of parameter space. 

The optimal point in the cell is found by using 
perceptron strategy, as described in Section || This 
procedure has been shown to converge to the true 
potential when used in exact models where rigorous 



where c is a positive quantity to be made as large as 
possible, the T^'s belong to the set of decoy confor- 
mations and the energy interactions satisfy to (jq). 

The function / in the denominator of (R) is a func- 
tion of the structural distance between To and T^. 
This serves the purpose of making inequalities (R) 



test are available (van Mourik et al. 1998) (Clementi 
Maritan & Banavar 1998|) . It is also possible that 
parametrizations (|2|) or (H) may not be sufficient to 
guarantee that a solution to inequalities (Q) exists. 
Indeed, if the decoys structures are very competitive 
with the native structures, three or further body in- 
teractions might be necessary to solve inequality (ffl) 
consistently (|Vendruscolo et aL 1999]). 



This procedure differs significantly from the one 



of Maiorov & Crippen (1992) where the parameters 
were determined in a sub-optimal manner. 

3 Results and discussion 



We succeeded in finding an optimal solution to the 
different systems of inequalities (ph : the optimal pa- 
rameters obtained for Hamiltonians (ph and / = 1 
are given in table (jl. 

We found that only a tiny fraction of all inequalities 
([?]) determine the optimal stability solution (more or 
less 100 out of 1551196 according to the / used or 
whether solvation term is present). It is important 
to ensure that the optimal solution does not fluctu- 
ate wildly when stringent inequalities are added or 
removed. To check this, we eliminated the 100 most 
stringent inequalities. Even though this completely 
replaces the walls of the physical cell, the new opti- 
mal solution slightly differed from the first one: rep- 
resenting the parameters in a 230-dimcnsional vector 
space the two vectors were only 15° apart n. Such 
a degree of correlation is significant because the ex- 
pected angle between two uncorrelated vectors in a 
space of sw 200 dimensions is about 90° ± 4°. This 
gives confidence in the robustness of the procedure 
and the statistics of the training set. 

The optimal parameters extracted with different 
trial forms of / in (J7]) were also closely correlated. As 
summarized in table |2j, their relative angle was always 
less than 15°. On the contrary, sub-optimal vectors, 
for which inequalities (m) are satisfied for c w (in 
which case the detailed form of / is not relevant) 
form, on average, an angle of 50° with the optimal 
solution. This fact underscores the importance of 
introducing an extremal criterion when maximizing 

The extracted solvation parameters showed a very 
good correlation (0.67 correlation coeffic ient) with 
the hydrophobicity scales as given by Creighton 
(1993). As shown in Fig. 4.2, the agreement is quite 



good except, perhaps, for proline. The discrepancy 
with proline finds a natural explanation within the 
scheme that we used. In fact, while the hydrophobic- 



ity scales in Fig. 4.2 relate to the propensities of indi- 



vidual, isolated amino acids, the solvation parameter 
reflects also their structural functionality in a peptide 
context. In fact, because the prolines are typically lo- 
cated in loop regions, they appear to have an effective 
hydrophilic propensity larger than their bare value. 

Finally, we carried out a stringent validation of the 
extracted potentials by performing a blind ground- 
state recognition on a test set. The test set (see Ta- 
bl e p|)was comprised of proteins t aken from those used 
in Miyazawa fc Jcrnigan ( 1996] ) and chosen so that 
they would meet some of the criteria used to select 
the training set (see section ||) . We deliberately in- 
troduced proteins with hetero groups, low degree of 
compactness and also pairs with high structural ho- 
mology. In all cases we ensured that no protein in 
the test set had a significant degree of structural ho- 
mology with those in the training one. 

We took, in turn, the sequences of the test set and 
threaded them on structures in the set with equal 
or longer length. Hence, we checked whether using 
the optimal potential parameters of Table p], the true 
native state was recognized as the lowest energy one. 
Indeed, this turned out to be the case for all but 6 
proteins. No higher success rate was found on using 
some other known sets of potentials consistent with 
the form of our Hamiltonian. 

Another relevant quantity related to the perfor- 
mance of the algorithm is given by the number of 
wrongly satisfied inequalities of type (0) for the test 
set. This quantity shows a much higher degree of 
variability than the number of correctly identified 



ground states and is given in column 3 of Table 4.2 



2 We note that only the direction of the vector of parameters 
is important, because it sets the rank of the conformations that 
a sequence can assume, while the norm of that vector just sets 
an energy scale 



It can be seen that the optimal parameters extracted 
with the solvent and / — 1 perform far better than 
those without the solvent and previously extracted 
ones. It also appears that, enforcing optimality pro- 
vides a dramatic reduction of wrong inequalities com- 
pared to the sub-optimal cases. This provides a sound 
a posteriori justification for the optimal extraction 
procedure as well as giving confidence in the param- 
eters. 

The few cases where the extracted potentials fail 



are due to one of the following situations: a) the na- 
tive protein is not too compact or b) it contains sta- 
bilizing hetero groups. Situations in which a highly 
homologous structure has a lower energy score than 
the native one are not deemed as errors. A typical 
energy/structural distance plot is shown in Fig. ||. It 
is apparent that homologous structures have energies 
similar to that of native conformations, while distant 
structures lie higher in energy. Some differences in 
the performance were observed for the sets of 210 
and 230 potentials. While the latter only fail to rec- 
ognize native states containing heme groups etc., the 
former occasionally fail to recognize the native states 
with no atypical feature (e.g. interleukin-4, lrcb). 

For proteins with heme groups, several structures 
score better: they usually present a smaller number 
of contacts than the native structure being less com- 
pact than the native state. This is possibly related 
to the presence of proline in an unusually buried po- 
sition, namely the heme pocket. In fact, due to the 
high effective solvation term assigned to proline, the 
native structure is penalized with respect to decoy 
ones where it is confined in more solvent-exposed po- 
sitions. 

An interesting case where the failure relates to a 
non-compact protein, is given by trp aporepressor 
(3wrp), for which several better scoring decoys exist. 
The explanation lies in the fact that 3wrp is always 
found as a dimer: the side of the protein binding its 
counterpart has non-polar surface residues usually in 
contact with non-polar residues on the other dimcr, 
which is not accounted for by our procedure. 

Nevertheless, the algorithm appears to work in 
other instances of non-compact conformations such 
as troponin c (4tnc) and calmodulin (lcll) and on 
some cytochrome- c as leer or lyeb, showing that 
the optimization procedure succeeds in extracting a 
potential with a wider applicability range than that 
given by the folds used in the training set. 

Over 15 different pairs of homologous structures 
(contact map distance less than 0.1), the energy func- 
tional is able to rank the true native state as the 
lowest in just 8 cases. In the other cases the native 
state attains an energy value slightly higher than the 
homologous one. As expected, the simple contact 
potential cannot distinguish the native state among 



very similar structures but it consistently assigns sim- 
ilar value of energy to similar conformations accord- 
ing to the degree of similarity (see Fig. ||) . 

It is important to note that there is a well-defined 
trend for the protein ground-state energies as a func- 
tion of protein length. Deviations from this typical 
trend could be used to assess the reliability of the 
predicted fold of a sequence with unknown structure 
(Fig. §. 



4 Methods 

4.1 Protein data sets 

We selected 142 protein structures from PDB 



(Bernstein, Koctzle, Williams, Meyer, Bricc. 



Elodgers, Kcnnard, Shimanouchi & Tasumi 1977), 
listed in Tab. |5|, with lengths var ying from 36 to 823, 
following criteria very similar to Maiorov & Crippen 



(1992). For each reference protein, we built a set of 
alternative conformations by threading its sequence 
on all the other structures in the set with a greater or 
equal number of amino acids. As explained in Jones 



2t al. (1992), threading a sequence of L amino acids 
on a structure, T', of length L' > L, involves mount- 
ing the sequence on all the (L' — L + 1) segments (of 
contiguous amino acids) taken from V . This proce- 
dure assigns the contact map of the threaded segment 
to the threading sequence. The inherited contact 
map is used to calculate the energy of the sequence 
in the alternative, threaded, conformation, and com- 
pared with the energy in its native state, which is 
required to be the global energy minimum. 

Only single chain structures have been selected in 
order to avoid the occurrence of interchain contacts 
between amino acids, that are not detected by our 
procedure and that could cause the stabilization of 
hydrophobic residues on a protein's surface. Con- 
sidering multiple chain structures would have intro- 
duced spurious effects in the extracted potentials, 
since inter-chain contacts would not be present in 
threaded conformations. For simplicity, however, we 
decided to retain proteins which may be found in 
polymeric forms. Because the presence of large het- 
ero groups can distort the usual geometry of dihedral 



angles between amino acids and cannot be treated 
in a simple way by a pairwise potential, we discarded 
protein structures with high percentages of non- water 
HETATM records in their PDB files (like HEM or 
CPS groups). 

We used the classification of 3-D protein structures 
SCOP flMurzin, Brenner, Hubbard fc Chothi^ |i995| ) 



to select proteins spanning a wide range of different 
three dimensional folds: no pair of proteins in our 
training set belong to the same SCOP family. Fur- 
thermore, we have included only proteins in the first 
4 SCOP classes: all-a, all-/?, a/ (3 and a + (3 pro- 
teins. Cell membrane or surface proteins and very 
short peptide chains are excluded because usually 
they are not stabilized by just amino acid interac- 
tions but by some external factors, such as the hy- 
drophobic environment, metal ligands, heme groups 
etc. No unresolved backbone atoms inside a chain are 
allowed; disordered or unresolved terminal backbone 
atoms are eliminated. 

We also disregarded proteins that were not typi- 
cally compact: because there is a clear dependence 
of the radius of gyration and the number of contacts 
among amino acids on chain length, we rejected from 
our training set proteins with too large a radius of 
gyration or with significantly fewer contacts than ex- 
pected for their length. The rejection was based using 



the quantitative procedures discussed in Maiorov & 



Crippen (1992) 



4.2 Optimal Stability Perceptron 

It is convenient to recast expression (0) so that the 
dependence from the interaction parameters between 
amino acid types A and A', e(A,A') appears explic- 
itly 

^ e(A,A') n Sz .ro{A,A')~n s .rM^ A ') . ,,os 

ig,T /»,r0) >{l&) 

where nsr(A, A') is the number of contacts between 
amino acids A and A' attained by S on I\ The in- 
dices A and A' run over the 20 amino acid classes for 
parametrization @. Expression [L3 can be rewritten in 
a more compact form by mapping the independent 
entries of the e matrix on a one-dimensional vector, 



e = {e(l,l),e(l,2),..,e(20,20)}. (13) 

and likewise for the vector 

(ns, r (l,l)-ns,r'(l,l)) 



N. 



s,r,r> 



{- 



-,...}/. (14) 



f(D(T',T)) 
With the former definitions, equation (11) becomes: 



j-7 • N Sit r tl r D > c . 



(15) 



A formally equivalent expression is obtained when 
using 230 parameters as in eq. (pi). 
Expression ( |15| ) leads to a geometrically appealing 
interpretation of the stability requirement. The op- 
timal stability is reached when the interaction vector 
has the largest possible inner product with all the 
Ns it r it r D vectors, also termed "patterns", originated 
from the training set. A rigorous solution for this 
geometrical problem was given by Krauth & Mezard 



(1987) who suggested an iterative procedure called 
optimal stability perceptron. 

The procedure is the following. Starting from a 
random (or an otherwise assigned) set of interactions 
satisfying the norm constraint (g|) , the stability score 
of all inequalities is computed. Then, the potentials 
are updated so to increase the stability of the lowest 
scoring inequality. This is done by adding to the orig- 
inal potentials vector, e*, a small term proportional to 
the worst scoring pattern (see Fig. ||): 



VA A' e(A, A') -» e(A, A') + -N sxx , (A, A') , 

(16) 
where d is the dimension of the parameter space (210 
or 230). Then, the inequalities are re-computed with 
the updated interaction parameters. The lowest scor- 
ing one is identified again and a new update of e is 
carried out. Note that the update ( |l6| ) will typically 
change the norm of e. The unit norm (see eqn. (|g)) 
can be conveniently enforced after convergence has 
been achieved. 

While convergence is guaranteed to be reached in 
a finite number of steps, the time required for each 



7 



iteration grows linearly with the number of inequal- 
ities. In our case, we typically dealt with « 10 6 in- 
equalities, and convergence sometimes required sev- 
eral thousand iterations (each taking few seconds of 
CPU time) . Hence, we devised a scheme to speed up 
convergence based on the observation that the sta- 
bility variation due to the change of parameters is 
proportional to the distance between the inequality 
point and the parameter direction, 



As = N ■ Ae* = N ± ? • Ae ± ?+N» ? ■ A% ex 



>o.o 
oc a ■\N±g\ + b 

#0.0 >0.0 



(17) 



This implies that inequality vectors far from the 
parameter direction will get the largest score vari- 
ations (positive or negative) and so they are more 
likely to become the lowest scoring ones. Accord- 
ingly, the standard perceptron procedure was run 
until reaching a sub-optimal value for the stability 
threshold, c > 0; this typically needed 300 iterations. 
Then we temporarily restricted the updating proce- 
dure to those inequalities lying outside a cone with 
axis along the parameter direction and vertex at dis- 
tance larger than c from origin (see Fig. pi ) . The cone 
width was determined to limit the number of inequal- 
ities to less than 20000 (Fig. J3|). In this way we had 
to deal with 10 4 inequalities that are 2 orders of mag- 
nitude smaller than the original ones, thus decreas- 
ing enormously the CPU time needed for optimiza- 
tion. Furthermore, after convergence, the neglected 
inequalities are found to be satisfied well above the 
optimal stability threshold c max , thus justifying the 
numerical shortcut. 

We conclude by remarking that, if the relative cor- 
rection to e(A, A') in eq. (|lj) is too large, this may 
result in a slowing down of the convergence. This 
difficulty can be readily circumvented by increasing 
the size of e by an order of magnitude. It was typ- 
ically necessary to repeat this "inflation" procedure 
3-4 times during each run towards convergence (see 
Fig. 0). This was sufficient to reach optimal conver- 
gence to the solution: Ac/c < 10~3. 
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0.0208 
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n 
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0.0202 
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-0.1610 


-0.1207 
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0.0528 


0.0892 


-0.1136 


0.0646 
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-0.0228 


0.0233 


0.0624 
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0.0358 


0.0047 


-0.1300 


-0.0895 


0.0580 


-0.0257 


-0.0469 


0.0288 


0.0214 


0.2016 


-0.0236 


-0.1070 


0.0085 


0.0671 


0.0733 


-0.0618 
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-0.0831 


-0.0511 


-0.0687 


0.1258 
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0.0623 


0.0087 
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-0.0576 


0.0138 








V 


-0.0658 


-0.0066 


0.0702 
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-0.0185 
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0.0911 


-0.0908 
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-0.0724 


0.0465 


0.0365 


-0.0351 


-0.0155 


0.0528 


0.0927 


0.0414 


-0.0637 
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0.0878 


0.0129 


0.0575 


0.0093 


0.0210 


0.0055 
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-0.0631 


0.0281 


0.0058 


0.0380 


-0.0487 


-0.0345 


-0.0733 


0.0007 


-0.0076 


0.0234 


-0.0504 


0.0118 
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-0.0367 


0.0389 


0.0111 


0.0521 


-0.1059 


0.0249 


-0.0507 


-0.0505 


0.0235 


-0.0265 


-0.0785 


-0.0117 


-0.0536 


-0.0089 


0.1021 


-0.0219 


-0.0317 


0.0482 


-0.0826 


0.1660 


Sol 


-0.0053 


-0.0850 


0.0737 


0.0575 


-0.0901 


0.0387 


-0.0201 


-0.0478 


0.0317 


-0.0448 


0.0164 


0.0186 


0.0801 


0.0121 


0.0141 


0.0202 


0.0006 
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-0.0462 
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I 



Solvent 


/=1 


/ = * 


10.1° 


/ = 1 


/ = * 2 


13.5° 


f = x 


/ = * 2 


6.67° 


/ = 1 


Non-optimal 


(average) 54° 


f = x 


Non-optimal 


(average) 55° 


f = x 2 


Non-optimal 


(average) 55° 


Non-optimal 


Non-optimal 


(average) 67° 


No-Solvent 


/=1 


f = x 


8.5° 


/=1 


f = x 2 


15.9° 


/ = z 


f = x 2 


10.0° 



Table 2: Angles formed by the optimal vectors for various forms of the Hamiltonian and / (see eqns. (||), 
(i and §. 
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Prot. code 


Length 


Native Energy 


No. Decoys 


No. Better Str. 


Average 
Difference 


SCOP 
Classification 


lhbg 


146 


-23.259936 


8007 





0±0 


1001001001001 


01)3 


lmba 


146 


-22.890319 


8007 





0±0 


1001001001001 


005 


lrnbs 


151 


-22.674067 


7756 





0±0 


1001001001001 


006 


Mil 


152 


-35.490600 


7707 





0±0 


1001001001001 


014 


211il> 


149 


-21.766920 


7855 





0±0 


1001001001001 


033 


lcty 


108 


-7.089541 


10259 


5 


-57 ± 81 


1001003001001 


004 


lyeb 


108 


-7.110300 


10259 


It 


-3±0 


1001003001001 


004 


leer 


111 


-7.226756 


10056 





0±0 


1001003001001 


006 


2c2c 


112 


-8.985422 


9989 


2 


-150 ± 170 


1001003001001 


009 


351c 


82 


-4.752413 


12127 


3 


17 ± 50 


1001003001001 


017 


* lle4 


139 


-25.618122 


8383 





0±0 


1001023001001 


003 


2mhr 


117 


-21.248559 


9670 





0±0 


1001023004001 


004 


lrcb 


129 


-30.026804 


8944 





0±0 


1001025001002 


002 


* 4tnc 


160 


3.959946 


7332 





0±0 


1001034001005 


001 


*lcll 


143 


9.773591 


8166 





0±0 


1001034001005 


005 


* lclm 


144 


9.950750 


8112 


1 


-370 ± 


1001034001005 


011 


leca 


291 


-36.626670 


2654 





0±0 


1001065001001 


003 


* 3wrp 


101 


-10.772739 


10755 


9 


235 ± 90 


1001078001001 


001 


lpoc 


134 


-31.651214 


8659 





0±0 


1001095001001 


001 


2imm 


114 


-17.753696 


9858 





0±0 


1002001001001 


024 


2rhc 


114 


-14.557044 


9858 





0±0 


1002001001001 


088 


2stv 


184 


-27.148904 


6318 





0±0 


1002008001002 


002 


2cna 


23? 


-30.109204 


4302 





0±0 


1002019001001 


001 


llec 


242 


-60.150421 


4129 





0±0 


1002019001001 


004 


lite 


239 


-36.671976 


4231 





0±0 


1002019001001 


005 


lshg 


57 


-14.578324 


14051 





0±0 


1002021002001 


006 


8adh 


374 


-82.544846 


1113 





0±0 


1002022001002 


001 


lgbt 


223 


-43.532366 


4821 





0±0 


1002031001002 


001 


lest 


240 


-63.644775 


4196 





0±0 


1002031001002 


013 


4apc 


330 


-78.686637 


1763 





0±0 


1002034001002 


001 


3app 


323 


-72.409291 


1897 





0±0 


1002034001002 


002 


2apr 


325 


-64.356790 


1856 





0±0 


1002034001002 


003 


3pep 


326 


-72.156108 


1836 





0±0 


1002034001002 


006 


lmpp 


356 


-87.210810 


1356 





0±0 


1002034001002 


009 


lems 


321 


-80.466445 


1940 





0±0 


1002034001002 


011 


lbrp 


173 


-30.565982 


6770 





0±0 


1002041001001 


002 


lump 


157 


-30.854982 


7471 





0±0 


1002041001001 


008 


2aaa 


475 


-92.682878 


340 


It 


37 ±0 


1002048001001 


008 


6taa 


476 


-93.727839 


335 





0±0 


1002048001001 


009 


lbtc 


490 


-96.939667 


292 





0±0 


1003001001002 


001 


laid 


363 


-67.400922 


1257 





0±0 


1003001003001 


002 


3enl 


436 


-52.025765 


553 





0±0 


1003001006001 


001 


lpii 


452 


-138.741679 


456 





0±0 


1003001008001 


001 


lxis 


385 


-37.102765 


980 





0±0 


1003001012001 


004 


lplih 


394 


-85.545211 


880 





0±0 


1003004001002 


002 


lgal 


580 


-91.922051 


111 





0±0 


1003004001002 


004 


lclhr 


236 


-44.849628 


4339 





0±0 


1003019001002 


006 


2cmd 


312 


-83.791376 


2147 





0±0 


1003019001005 


002 


11dm 


329 


-73.012055 


1781 





0±0 


1003019001005 


008 


igky 


186 


-30.194166 


6237 





0±0 


1003025001001 


001 


3adk 


194 


-28.274731 


5924 





0±0 


1003025001001 


006 


121p 


166 


-37.077357 


7067 


It 


84 ±0 


1003025001003 


001 


4q21 


167 


-40.978706 


7023 





0±0 


1003025001003 


001 


lsbc 


274 


-56.392841 


3113 





0±0 


1003028001001 


001 


ltlini 


279 


-51.381279 


2968 





0±0 


1003028001001 


003 


IsOl 


275 


-47.482543 


3082 





0±0 


1003028001001 


006 


ls02 


275 


-42.291256 


3082 


It 


3±0 


1003028001001 


006 


2prk 


279 


-50.558910 


2968 





0±0 


1003028001001 


007 


lama 


401 


-83.687554 


813 





0±0 


1003048001001 


001 


lspa 


396 


-78.338515 


859 





0±0 


1003048001001 


004 


lipd 


345 


-51.476690 


1522 





0±0 


1003057001001 


001 


3icd 


414 


-55.210210 


708 





0±0 


1003057001001 


003 


lrhxl 


292 


-59.479630 


2628 





0±0 


1003060001001 


001 


3p£k 


319 


-60.742159 


1985 





0±0 


1003070001001 


002 


lovb 


159 


-37.670779 


7378 





0±0 


1003073001002 


002 


11% 


690 


-160.321444 








0±0 


1003073001002 


005 


1321 


129 


-32.386126 


8944 


It 


14 ± 


1004002001002 


001 


llz3 


129 


-32.719245 


8944 





0±0 


1004002001002 


002 


llaa 


130 


-40.734890 


8884 





0±0 


1004002001002 


008 


laic 


121 


-36.611850 


9425 





0±0 


1004002001002 


013 


3118 


68 


-18.129049 


13192 





0±0 


1004007001001 


001 


lfkb 


106 


-14.625747 


10399 





0±0 


1004019001001 


001 


lyat 


113 


-13.260539 


9923 





0±0 


1004019001001 


003 


lctf 


68 


-11.121616 


13192 





0±0 


1004026001001 


001 


Ud2 


106 


-27.581742 


10399 





0±0 


1004033001002 


001 


2fxb 


81 


-7.781913 


12202 





0±0 


1004033001004 


003 


3tms 


264 


-60.007310 


3424 





0±0 


1004063001001 


001 


3b5c 


84 


-12.241707 


11980 





0±0 


1004066001001 


001 



Table 3: Proteins used in the test set. The symbol f denotes instances where the better scoring structure 
is homologous to the target protein, while a * marks non compact native state, llfg has been used only as 
structural template. 
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Score on Test 


Solvent 


Not Rcc. Str. 


Unsat. Ineq. 


/ = 1 


5 


25 


f = x 


5 


33 


f = x' 


5 


36 


Non-Optimal 


5.25 


75.7 


No Solvent 






/ = 1 


6 


118 


f = x 


6 


200 


f = x 2 


6 


254 


DBMS 


7 


51 


KGS 


6 


452 


MC 


6 


48 




Score on Training 


DBMS 
KGS 
MC 


13 
22 
20 


1091 
9789 
1826 



Table 4: Performance of the potentials extracted in this work and other known sets. The second column 
gives the number of unrecognised native states among the 78 ones of Table g. The associated number of 
violated inequalities (against a total of 444199) is given in column 3. The acronyms for the alternative 
potentials refer to: DBMS=(Dima et al. 1999), KGS=(Kolinsky et al. 1993), MC=(Maiorov & Crippen 
1994). The last part of the table shows the scores of the alternative potentials applied to our training set of 
142 proteins generating 1551196 inequalities (on which our potentials, by definition, scores 100% success). 
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Prot. Code 


Length 


SCOP Classification 


No. Decoys 


lvii 


36 


1001014001001 


001 


25461 


lord 


40 


1001010001001 


002 


24896 


lpru 


56 


1001030001003 


001 


22655 


lfxd 


58 


1004033001001 


001 


22376 


ligd 


61 


1004012001001 


001 


21961 


lore 


64 


1001030001002 


005 


21549 


lsap 


66 


1004009001001 


002 


21276 


lmit 


67 


1004022001001 


003 


21140 


lutg 


69 


1001072001001 


001 


20871 


lail 


70 


1001015001001 


001 


20737 


lhoe 


74 


1002004001001 


001 


20208 


lkjs 


74 


1001040001001 


001 


20208 


lubi 


74 


1004012002001 


001 


20208 


lhyp 


75 


1001042001001 


001 


20076 


5icb 


75 


1001034001001 


001 


20076 


lfow 


76 


1001004004001 


001 


19947 


ltif 


76 


1004012006001 


001 


19947 


ltnt 


76 


1001006001001 


001 


19947 


lacp 


77 


1001026001001 


001 


19820 


lhdj 


77 


1001002002001 


001 


19820 


liba 


77 


1004053001001 


001 


19820 


lvcc 


77 


1004067001001 


001 


19820 


lcoo 


81 


1001032001001 


001 


19336 


lcei 


84 


1001026002001 


001 


18978 


lngr 


84 


1001062001001 


001 


18978 


lopd 


85 


1004052001001 


003 


18859 


lfna 


90 


1002001002001 


002 


18278 


lhqi 


90 


1004079001001 


001 


18278 


lwho 


94 


1002006003001 


001 


17820 


lpdr 


96 


1002023001001 


001 


17593 


lbeo 


98 


1001096001001 


001 


17368 


ltul 


101 


1002060004001 


001 


17034 


9rnt 


104 


1004001001001 


003 


16703 


laac 


105 


1002005001001 


001 


16593 


lerv 


105 


1003033001001 


004 


16593 


Ijpc 


108 


1002054001001 


001 


16270 


lkum 


108 


1002003001001 


005 


16270 


lrro 


108 


1001034001004 


001 


16270 


3ssi 


108 


1004044001001 


002 


16270 


2mcm 


112 


1002001006001 


001 


15854 


lmai 


118 


1002037001001 


001 


15241 


lpoa 


118 


1001095001002 


001 


15241 


lwhi 


122 


1002025001001 


001 


14839 


lyua 


122 


1004067001002 


001 


14839 
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7rsa 


124 


1004004001001 


001 


14641 


2phy 


125 


1004061002001 


001 


14543 


lbfg 


126 


1002028001001 


001 


14446 


3chy 


128 


1003013002001 


001 


14255 


lpdo 


129 


1003040001001 


001 


14160 


ltum 


129 


1004062001001 


001 


14160 


life 


131 


1002041001002 


002 


13974 


lkuh 


131 


1004050001001 


001 


13974 


His 


131 


1001017001001 


001 


13974 


lrsy 


132 


1002006001002 


001 


13882 


lcof 


135 


1004060001002 


001 


13617 


2end 


137 


1001016001001 


001 


13442 


5nul 


138 


1003013004001 


006 


13355 


2sns 


140 


1002026001001 


001 


13184 


llcl 


141 


1002019001003 


004 


13099 


llba 


145 


1004064001001 


001 


12766 


lpkp 


145 


1004011001001 


002 


12766 


lvsd 


145 


1003041003002 


001 


12766 


lnpk 


150 


1004033006001 


002 


12363 


lirp 


153 


1002028001002 


003 


12125 


2rn2 


155 


1003041003001 


001 


11968 


lvhh 


157 


1004034001002 


001 


11813 


lgpr 


158 


1002059003001 


001 


11736 


lra9 


159 


1003053001001 


001 


11660 


1191 


162 


1004002001003 


001 


11437 


2cpl 


164 


1002043001001 


001 


11290 


lsfe 


165 


1001004002001 


001 


11217 


lwba 


171 


1002028003001 


001 


10790 


2fha 


171 


1001024001001 


003 


10790 


Lamm 


173 


1002009001001 


001 


10650 


2prd 


173 


1002026005001 


003 


10650 


lido 


184 


1003045001001 


002 


9911 


1531 


185 


1004002001004 


001 


9844 


lxnb 


185 


1002019001008 


001 


9844 


lknb 


186 


1002016001001 


001 


9778 


lkid 


192 


1003005003001 


001 


9399 


lcex 


197 


1003013007001 


001 


9088 


lchd 


198 


1003027001001 


001 


9026 


lfua 


206 


1003055001001 


001 


8545 


lthv 


207 


1002018001001 


001 


8485 


2abk 


211 


1001066001001 


001 


8252 


lah6 


213 


1004068001001 


001 


8137 


llbu 


213 


1001019001001 


001 


8137 


3cla 


213 


1003030001001 


001 


8137 


2ayh 


214 


1002019001002 


002 


8080 


lgpc 


217 


1002026004007 


003 


7920 
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lakz 


223 


1003011001001 


001 


7607 


ldad 


224 


1003025001005 


001 


7555 


laol 


227 


1002015001001 


001 


7404 


lcby 


227 


1004058001001 


001 


7404 


llbd 


238 


1001087001001 


001 


6874 


2baa 


243 


1004002001001 


001 


6638 


lmrj 


247 


1004094001001 


001 


6453 


3fib 


248 


1004098001001 


001 


6407 


lplq 


258 


1004076001002 


001 


5966 


2cba 


258 


1002050001001 


002 


5966 


larb 


262 


1002031001001 


001 


5796 


lako 


268 


1004086001001 


001 


5549 


2dri 


271 


1003072001001 


001 


5428 


ltml 


286 


1003002001001 


001 


4842 


lhan 


287 


1004020001003 


002 


4803 


lnar 


289 


1003001001005 


002 


4728 


lamp 


290 


1003052003004 


001 


4691 


lctt 


294 


1003075001001 


001 


4550 


2ctc 


307 


1003052003001 


001 


4107 


lede 


310 


1003050001003 


001 


4007 


lpgs 


311 


1002011001001 


001 


3974 


lads 


315 


1003001005001 


002 


3849 


lhyt 


316 


1001053001001 


002 


3818 


ltca 


317 


1003050001007 


001 


3788 


lpot 


321 


1003073001001 


Oil 


3675 


laxn 


323 


1001051001001 


001 


3620 


ldxy 


329 


1003013009001 


002 


3463 


lnif 


333 


1002005001003 


001 


3362 


lrpa 


341 


1003043001002 


001 


3169 


luby 


348 


1001091001001 


001 


3007 


lidk 


359 


1002056001002 


001 


2764 


leur 


360 


1002045001001 


004 


2742 


lcem 


363 


1001073001002 


001 


2681 


lpud 


372 


1003001017001 


001 


2509 


lkaz 


377 


1003041001001 


001 


2418 


ledg 


380 


1003001001003 


002 


2366 


lphp 


394 


1003066001001 


003 


2141 


lphc 


405 


1001075001001 


001 


1975 


luae 


417 


1004035002001 


001 


1806 


lgnd 


430 


1003004001003 


001 


1636 


lcsh 


433 


1001074001001 


001 


1599 


lpmi 


440 


1002058002001 


001 


1521 


lgcb 


452 


1004003001001 


008 


1400 


2bnh 


456 


1003007001001 


001 


1363 


3grs 


461 


1003004001004 


001 


1322 


lgai 


471 


1001073001001 


001 


1251 
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11am 


484 


1003036001001 


001 


1172 


lvnc 


576 


1001080001001 


001 


711 


lciy 


577 


1002013001002 


002 


706 


lamj 


753 


1003005002001 


002 


177 


lgpb 


823 


1003068001002 


001 


36 


lqbaP) 


858 


1002001001005 


002 






Table 5: List of "training proteins" used to extract interaction 
potentials. 



'The longest protein in the set, lqba, was used only as a structural template 
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Figure 1: The extracted solvation parameters, e(0, A) (see eqn. (0)) versus standard hydrophobicity values 
(Creighton 1993, p. 154). 
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Figure 2: Schematic representation of a typical perceptron update in a two-dimensional space. Inequalities 
are represented by vectors connecting the origin to the points. At ineration n, The stability, c, of ineq. (0) 
is the smallest inner product between the parameter vector e n and each of the inequalities. In case (a) c 
is given by e ■ N and e n accordingly aquires a small component parallel to N. An equilibrium situations is 
shown in (b). Successive updates cause e to bounce on either side of the equilibrium direction. The latter 
is reached in a finite time, because the relative size of the added component decreases with the number of 
iterations. 
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Figure 3: After a sufficient number of iterations, 
successive perceptron updates will not change appre- 
ciably e. To speed up convergence, it is convenient to 
temporarily retain only those inequalities (points in 
parameter space) lying outside a cone with axis along 
e*and suitable vertex and width (see text). The edge 
of the cone is visible in this figure, where only the 
retained inequalities are shown. 




Figure 4: Perceptron stability as a function of the 
number of iterations. The discontinuities are associ- 
ated with the "inflation" of e used to speed up con- 
vergence (see text). 



Figure 5: Energies of the protein sequence llz3 
when threaded on decoy structures against structual 
dissimilarity. Very low energies are observed when 
threading on homologous conformations (in particu- 
lar protein 1321). 
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Figure 6: When using the extracted parameters of 
Table [l] the native state energy of proteins shows an 
approximately linear behavious as a function of their 
length. Points refer to proteins for both the train- 
ing and test sets. Proteins with less than 200 amino 
acids and atypical compactness present significant de- 
viations from the average trend. 
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